What is ggplot2?

The transferrable skills from ggplot2 are not the idiosyncracies of plotting syntax, but a powerful way of thinking about visualisation, as a way of mapping between variables and the visual properties of geometric objects that you can perceive.
Hadley Wickham

Source: http://disq.us/p/sv640d

  • gg is for Grammar of Graphics
  • ggplot2 is a huge package: philosophy + functions

Getting started

Easy: install the tidyverse

install.packages('tidyverse')

Medium: install just ggplot2

install.packages('ggplot2')

Expert: install from GitHub (latest development version)

devtools::install_github('tidyverse/ggplot2')

Load the tidyverse

library(tidyverse)

Other packages for this tutorial

We’ll use an excerpt of the gapminder dataset provided by the gapminder package by Jenny Bryan.

# uncomment the next line to install {gapminder} package if not installed yet
# install.packages("gapminder")
library(gapminder)

Concepts of ggplot2

How do we express visuals in words?

  • Data to be visualized

  • Aesthetic mappings from data to visual component

  • Geometric objects that appear on the plot

  • Facets group into subplots

  • Coordinates organize location of geometric objects

  • Scales define the range of values for aesthetics

  • Statistics transform data on the way to visualization

Tidy Data

Data

Data

ggplot(data)

Tidy Data

  1. Each variable forms a column

  2. Each observation forms a row

  3. Each observational unit forms a table

Key

  1. What information do I want to use in my visualization?

  2. Is that data contained in one column/row for a given data point?

Data

ggplot(data)
country 1997 2002 2007
Canada 30.30584 31.90227 33.39014
China 1230.07500 1280.40000 1318.68310
United States 272.91176 287.67553 301.13995
tidy_pop <- gather(messy_pop, 'year', 'pop', -country)
country year pop
Canada 1997 30.30584
China 1997 1230.07500
United States 1997 272.91176
Canada 2002 31.90227
China 2002 1280.40000
United States 2002 287.67553
Canada 2007 33.39014
China 2007 1318.68310
United States 2007 301.13995

Aesthetic

Data
Aesthetic Mapping

+ aes()

Mapping

Map data to visual elements or parameters

  • year → x

  • pop → y

  • country → shape, color, etc.

aes(
  x = year,
  y = pop,
  color = country
)

Geometric

Data
Aesthetic Mapping
Geometric Objects

+ geom_*()

Geometric Objects

Geometric objects displayed on the plot


Here are the some of the most widely used geoms

Type Function
Point geom_point()
Line geom_line()
Bar geom_bar(), geom_col()
Histogram geom_histogram()
Regression geom_smooth()
Boxplot geom_boxplot()
Text geom_text()
Vert./Horiz. Line geom_{vh}line()
Count geom_count()
Density geom_density()

https://eric.netlify.com/2017/08/10/most-popular-ggplot2-geoms/


See http://ggplot2.tidyverse.org/reference/ for many more options

##  [1] "geom_abline"     "geom_area"       "geom_bar"       
##  [4] "geom_bin2d"      "geom_blank"      "geom_boxplot"   
##  [7] "geom_col"        "geom_contour"    "geom_count"     
## [10] "geom_crossbar"   "geom_curve"      "geom_density"   
## [13] "geom_density_2d" "geom_density2d"  "geom_dotplot"   
## [16] "geom_errorbar"   "geom_errorbarh"  "geom_freqpoly"  
## [19] "geom_hex"        "geom_histogram"  "geom_hline"     
## [22] "geom_jitter"     "geom_label"      "geom_line"      
## [25] "geom_linerange"  "geom_map"        "geom_path"      
## [28] "geom_point"      "geom_pointrange" "geom_polygon"   
## [31] "geom_qq"         "geom_qq_line"    "geom_quantile"  
## [34] "geom_raster"     "geom_rect"       "geom_ribbon"    
## [37] "geom_rug"        "geom_segment"    "geom_sf"        
## [40] "geom_sf_label"   "geom_sf_text"    "geom_smooth"    
## [43] "geom_spoke"      "geom_step"       "geom_text"      
## [46] "geom_tile"       "geom_violin"     "geom_vline"

Or just start typing geom_ in RStudio

Our First Plot!

ggplot(tidy_pop)

ggplot(tidy_pop) +
  aes(x = year, #<<
      y = pop) #<<

ggplot(tidy_pop) +
  aes(x = year,
      y = pop) +
  geom_point() #<<

ggplot(tidy_pop) +
  aes(x = year,
      y = pop,
      color = country) + #<<
  geom_point()

ggplot(tidy_pop) +
  aes(x = year,
      y = pop,
      color = country) + 
  geom_point() +
  geom_line() #<<
geom_path: Each group consists
of only one observation. 
Do you need to adjust the 
group aesthetic?

ggplot(tidy_pop) +
  aes(x = year,
      y = pop,
      color = country) +
  geom_point() +
  geom_line(
    aes(group = country)) #<<

g <- ggplot(tidy_pop) +
  aes(x = year,
      y = pop,
      color = country) +
  geom_point() +
  geom_line(
    aes(group = country))

g

Data
Aesthetic Mapping
Geometric Objects

+ geom_*()

Geometric Objects

geom_*(mapping, data, stat, position)
  • data Geoms can have their own data
    • Has to map onto global coordinates
  • map Geoms can have their own aesthetics
    • Inherits global aesthetics
    • Have geom-specific aesthetics
      • geom_point needs x and y, optional shape, color, size, etc.
      • geom_ribbon requires x, ymin and ymax, optional fill
    • ?geom_ribbon

geom_*(mapping, data, stat, position)
  • stat Some geoms apply further transformations to the data
    • All respect stat = 'identity'
    • Ex: geom_histogram uses stat_bin() to group observations
  • position Some adjust location of objects
    • 'dodge', 'stack', 'jitter'

Facets

Data
Aesthetic Mapping
Geometric Objects
Facets

+facet_wrap() 

+facet_grid()

Facets

g + facet_wrap(~ country)


g + facet_grid(continent ~ country)

Coordinates

Data
Aesthetic Mapping
Geometric Objects
Facets
Coordinates

+ coord_*()

Coordinates

g + coord_flip()


g + coord_polar()

Scales

Data
Aesthetic Mapping
Geometric Objects
Facets
Coordinates
Scales

+ scale_*_*()

Scales scale + _ + <aes> + _ + <type> + ()

What parameter do you want to adjust? → <aes>
What type is the parameter? → <type>

  • I want to change my discrete x-axis
    scale_x_discrete()
  • I want to change range of point sizes from continuous variable
    scale_size_continuous()
  • I want to rescale y-axis as log
    scale_y_log10()
  • I want to use a different color palette
    scale_fill_discrete()
    scale_color_manual()

g + scale_color_manual(values = c("peru", "pink", "plum"))


g + scale_y_log10()


g + scale_x_discrete(labels = c("MCMXCVII", "MMII", "MMVII"))

Statistics

Data
Aesthetic Mapping
Geometric Objects
Facets
Coordinates
Scales
Statistics

stat_count()

stat_identity()

Statistics stat_count() is not used/called explicitly, and typically used in conjuction with geom_*() that visualize counts - geom_histogram(), geom_bar(), geom_col().

ggplot(gapminder, aes(gdpPercap)) +
  geom_histogram(aes(y = stat(count)))

Note

geom_bar() uses stat_count() by default: it counts the number of cases at each x position.

geom_col() uses stat_identity(): it leaves the data as is.

Labels

Data
Aesthetic Mapping
Geometric Objects
Facets
Coordinates
Scales
Statistics
Labels

+ labs()

Labels

g + labs(x = "Year", y = "Population")

Themes

Data
Aesthetic Mapping
Geometric Objects
Facets
Coordinates
Scales
Statistics
Labels
Themes

+ theme()

Themes Change the appearance of plot decorations
i.e. things that aren’t mapped to data

A few “starter” themes ship with the package

  • g + theme_bw()
  • g + theme_dark()
  • g + theme_gray()
  • g + theme_light()
  • g + theme_minimal()

Huge number of parameters, grouped by plot area:

  • Global options: line, rect, text, title
  • axis: x-, y- or other axis title, ticks, lines
  • legend: Plot legends
  • panel: Actual plot area
  • plot: Whole image
  • strip: Facet labels

Theme options are supported by helper functions:

  • element_blank() removes the element
  • element_line()
  • element_rect()
  • element_text()

g + theme_bw()


g + theme_minimal() + theme(text = element_text(family = "sans"))


You can also set the theme globally with theme_set()

All plots will now use this theme!

my_theme <- theme_bw() +
  theme(
    text = element_text(family = "sans", size = 12),
    panel.border = element_rect(colour = 'grey80'), 
    panel.grid.minor = element_blank()
  )

theme_set(my_theme)

g


You may also alter certain aspects of the plot, in addition to the defaults set in theme_set(); in this case, the legend is moved to the bottom.

g + theme(legend.position = 'bottom')

Saving Your Work

To save your plot, use ggsave

ggsave(
  filename = "my_plot.png",
  plot = my_plot,
  width = 10,
  height = 8,
  dpi = 100,
  device = "png"
)

Your First Plot!

library(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
Observations: 1,704
Variables: 6
$ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Let’s start with lifeExp vs gdpPercap

ggplot(gapminder) +
  aes(x = gdpPercap,
      y = lifeExp)

Add points…

ggplot(gapminder) +
  aes(x = gdpPercap,
      y = lifeExp) +
  geom_point() #<<

How can I tell countries apart?

ggplot(gapminder) +
  aes(x = gdpPercap,
      y = lifeExp,
      color = continent) + #<<
  geom_point()

GDP is squished together on the left

ggplot(gapminder) +
  aes(x = gdpPercap,
      y = lifeExp,
      color = continent) +
  geom_point() +
  scale_x_log10() #<<

Still lots of overlap in the countries…

ggplot(gapminder) +
  aes(x = gdpPercap,
      y = lifeExp,
      color = continent) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~ continent) + #<<
  guides(color = FALSE)     #<<

No need for color legend thanks to facet titles

Lots of overplotting due to point size

ggplot(gapminder) +
  aes(x = gdpPercap,
      y = lifeExp,
      color = continent) +
  geom_point(size = 0.25) + #<<
  scale_x_log10() +
  facet_wrap(~ continent) +
  guides(color = FALSE)

Is there a trend?

ggplot(gapminder) +
  aes(x = gdpPercap,
      y = lifeExp,
      color = continent) +
  geom_line() + #<<
  geom_point(size = 0.25) +
  scale_x_log10() +
  facet_wrap(~ continent) +
  guides(color = FALSE)

Okay, that line just connected all of the points sequentially…

ggplot(gapminder) +
  aes(x = gdpPercap,
      y = lifeExp,
      color = continent) +
  geom_line(
    aes(group = country) #<<
  ) +
  geom_point(size = 0.25) +
  scale_x_log10() +
  facet_wrap(~ continent) +
  guides(color = FALSE)

We need time on x-axis!

ggplot(gapminder) +
  aes(x = year, #<<
      y = gdpPercap, #<<
      color = continent) +
  geom_line(
    aes(group = country)
  ) +
  geom_point(size = 0.25) +
  scale_y_log10() + #<<
  facet_wrap(~ continent) +
  guides(color = FALSE)

Can’t see x-axis labels, though

ggplot(gapminder) +
  aes(x = year,
      y = gdpPercap,
      color = continent) +
  geom_line(
    aes(group = country)
  ) +
  geom_point(size = 0.25) +
  scale_y_log10() +
  scale_x_continuous(breaks = #<<
    seq(1950, 2000, 25) #<<
  ) +                            #<<
  facet_wrap(~ continent) +
  guides(color = FALSE)

What about life expectancy?

ggplot(gapminder) +
  aes(x = year, 
      y = lifeExp, #<<
      color = continent) +
  geom_line(
    aes(group = country)
  ) +
  geom_point(size = 0.25) +
  #scale_y_log10() + #<<
  scale_x_continuous(breaks = 
    seq(1950, 2000, 25)
  ) +  
  facet_wrap(~ continent) +
  guides(color = FALSE)

Okay, let’s add a trend line

ggplot(gapminder) +
  aes(x = year, 
      y = lifeExp,
      color = continent) +
  geom_line(
    aes(group = country)
  ) +
  geom_point(size = 0.25) +
  geom_smooth() + #<<
  scale_x_continuous(breaks = 
    seq(1950, 2000, 25)
  ) +  
  facet_wrap(~ continent) +
  guides(color = FALSE)

De-emphasize individual countries

ggplot(gapminder) +
  aes(x = year, 
      y = lifeExp,
      color = continent) +
  geom_line(
    aes(group = country),
    color = "grey75" #<<
  ) +
  geom_point(size = 0.25) +
  geom_smooth() + 
  scale_x_continuous(breaks = 
    seq(1950, 2000, 25)
  ) +  
  facet_wrap(~ continent) +
  guides(color = FALSE)

Points are still in the way

ggplot(gapminder) +
  aes(x = year, 
      y = lifeExp,
      color = continent) +
  geom_line(
    aes(group = country),
    color = "grey75"
  ) +
  #geom_point(size = 0.25) + #<<
  geom_smooth() + 
  scale_x_continuous(breaks = 
    seq(1950, 2000, 25)
  ) +  
  facet_wrap(~ continent) +
  guides(color = FALSE)

Let’s compare continents

ggplot(gapminder) +
  aes(x = year, 
      y = lifeExp,
      color = continent) +
  geom_line(
    aes(group = country),
    color = "grey75"
  ) +
  geom_smooth() + 
  # scale_x_continuous(
  #   breaks = 
  #     seq(1950, 2000, 25)
  # ) +  
  # facet_wrap(~ continent) + #<<
  guides(color = FALSE)

Wait, what color is each continent?

ggplot(gapminder) +
  aes(x = year, 
      y = lifeExp,
      color = continent) +
  geom_line(
    aes(group = country),
    color = "grey75"
  ) +
  geom_smooth() + 
  theme( #<<
  legend.position = "bottom" #<<
  ) #<<

Let’s try the minimal theme

ggplot(gapminder) +
  aes(x = year, 
      y = lifeExp,
      color = continent) +
  geom_line(
    aes(group = country),
    color = "grey75"
  ) +
  geom_smooth() + 
  theme_minimal() + #<<
  theme(
  legend.position = "bottom"
  )

Fonts are kind of big

ggplot(gapminder) +
  aes(x = year, 
      y = lifeExp,
      color = continent) +
  geom_line(
    aes(group = country),
    color = "grey75"
  ) +
  geom_smooth() + 
  theme_minimal( 
    base_size = 8) + #<<
  theme(
  legend.position = "bottom"
  )

Cool, let’s switch gears

americas <- 
  gapminder %>% 
  filter(
    country %in% c(
      "United States",
      "Canada",
      "Mexico",
      "Ecuador"
    )
  )

Let’s look at four countries in more detail. How do their populations compare to each other?

## # A tibble: 48 x 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # ... with 38 more rows
ggplot(americas) +
  aes(
    x = year,
    y = pop
  ) +
  geom_col()

Let’s look at four countries in more detail. How do their populations compare to each other?

Yeah, but how many people are in each country?

ggplot(americas) +
  aes(
    x = year,
    y = pop,
    fill = country #<<
  ) +
  geom_col()

Bars are “stacked”, can we separate?

ggplot(americas) +
  aes(
    x = year,
    y = pop,
    fill = country
  ) +
  geom_col(
    position = "dodge" #<<
  )

position = "dodge" places objects next to each other instead of overlapping

What is scientific notation anyway?

ggplot(americas) +
  aes(
    x = year,
    y = pop / 10^6, #<<
    fill = country
  ) +
  geom_col(
    position = "dodge" 
  )

ggplot aesthetics can take expressions!

Might be easier to see countries individually

ggplot(americas) +
  aes(
    x = year,
    y = pop / 10^6,
    fill = country
  ) +
  geom_col(
    position = "dodge" 
  ) +
  facet_wrap(~ country) + #<<
  guides(fill = FALSE) #<<

Let range of y-axis vary in each plot

ggplot(americas) +
  aes(
    x = year,
    y = pop / 10^6,
    fill = country
  ) +
  geom_col(
    position = "dodge" 
  ) +
  facet_wrap(~ country,
    scales = "free_y") + #<<
  guides(fill = FALSE)

What about life expectancy again?

ggplot(americas) +
  aes(
    x = year,
    y = lifeExp, #<<
    fill = country
  ) +
  geom_col(
    position = "dodge" 
  ) +
  facet_wrap(~ country,
    scales = "free_y") +
  guides(fill = FALSE)

This should really be 📈…instead of 📊

ggplot(americas) +
  aes(
    x = year,
    y = lifeExp,
    fill = country
  ) +
  geom_line() + #<<
  facet_wrap(~ country,
    scales = "free_y") +
  guides(fill = FALSE)

📊 are filled 📈 are colored

ggplot(americas) +
  aes(
    x = year,
    y = lifeExp,
    color = country #<<
  ) +
  geom_line() +
  facet_wrap(~ country,
    scales = "free_y") +
  guides(color = FALSE) #<<

Altogether now!

ggplot(americas) +
  aes(
    x = year,
    y = lifeExp,
    color = country
  ) +
  geom_line()

Okay, changing gears again. What is range of life expectancy in Americas?

gapminder %>% 
  filter(
    continent == "Americas"
  ) %>% #<<
  ggplot() + #<<
  aes(
    x = year,
    y = lifeExp
  )

You can pipe into ggplot()!
Just watch for %>% changing to +

Boxplot for life expectancy range

gapminder %>% 
  filter(
    continent == "Americas"
  ) %>%
  ggplot() +
  aes(
    x = year,
    y = lifeExp
  ) +
  geom_boxplot() #<<

Why not boxplots by year?

gapminder %>% 
  filter(
    continent == "Americas"
  ) %>%
  mutate( #<<
    year = factor(year) #<<
  ) %>%  #<<
  ggplot() +
  aes(
    x = year,
    y = lifeExp
  ) +
  geom_boxplot()

OK, what about global life expectancy?

gapminder %>% 
  # filter(
  #   continent == "Americas"
  # ) %>%
  mutate(
    year = factor(year)
  ) %>% 
  ggplot() +
  aes(
    x = year,
    y = lifeExp
  ) +
  geom_boxplot()

Can we have cute little boxplots for each continent?

gapminder %>% 
  mutate(
    year = factor(year)
  ) %>% 
  ggplot() +
  aes(
    x = year,
    y = lifeExp,
    fill = continent #<<
  ) +
  geom_boxplot()

Hard to read years, let’s rotate

gapminder %>% 
  mutate(
    year = factor(year)
  ) %>% 
  ggplot() +
  aes(
    x = year,
    y = lifeExp,
    fill = continent
  ) +
  geom_boxplot() +
  coord_flip() #<<

Use dplyr::mutate() to group by decade

gapminder %>% 
  mutate(
    decade = floor(year / 10), #<<
    decade = decade * 10,      #<<
    decade = factor(decade)      #<<
  ) %>% 
  ggplot() +
  aes(
    x = decade, #<<
    y = lifeExp,
    fill = continent
  ) +
  geom_boxplot() +
  coord_flip()

Let’s hide Oceania…

g <- gapminder %>% 
  filter( #<<
    continent != "Oceania" #<<
  ) %>% #<<
  mutate(
    decade = floor(year / 10) * 10, decade = factor(decade)      
  ) %>% 
  ggplot() +
  aes(
    x = decade,
    y = lifeExp,
    fill = continent
  ) +
  geom_boxplot() +
  coord_flip()

Labeling the plot

g +
  theme_minimal(8) +
  labs(
    y = "Life Expectancy",
    x = "Decade",
    fill = NULL,
    title = "Life Expectancy by Continent and Decade",
    caption = "gapminder.org"
  )

Note x and y are original aesthetics, coord_flip() happens after.

Remove legend labels by setting = NULL.

Extra Resources

Stack Exchange

Google

ggplot2 Extensions

ggplot2 extensions

ggplot2 and beyond

Learn more

Noteworthy RStudio Add-Ins

Practice and Review

Fun Datasets

  • fivethirtyeight

  • nycflights

  • ggplot2movies

Review

  • Slides and code on GitHub: <TODO!!!>

Credits

@grrrck
github.com/gadenbuie
Garrick Aden-Buie